Search CORE

13 research outputs found

An Ultra-Low-Energy, Variation-Tolerant FPGA Architecture Using Component-Specific Mapping

Author: Mehta Nikil
Publication venue
Publication date: 01/01/2013
Field of study

As feature sizes scale toward atomic limits, parameter variation continues to increase, leading to increased margins in both delay and energy. Parameter variation both slows down devices and causes devices to fail. For applications that require high performance, the possibility of very slow devices on critical paths forces designers to reduce clock speed in order to meet timing. For an important and emerging class of applications that target energy-minimal operation at the cost of delay, the impact of variation-induced defects at very low voltages mandates the sizing up of transistors and operation at higher voltages to maintain functionality. With post-fabrication configurability, FPGAs have the opportunity to self-measure the impact of variation, determining the speed and functionality of each individual resource. Given that information, a delay-aware router can use slow devices on non-critical paths, fast devices on critical paths, and avoid known defects. By mapping each component individually and customizing designs to a component's unique physical characteristics, we demonstrate that we can eliminate delay margins and reduce energy margins caused by variation. To quantify the potential benefit we might gain from component-specific mapping, we first measure the margins associated with parameter variation, and then focus primarily on the energy benefits of FPGA delay-aware routing over a wide range of predictive technologies (45 nm--12 nm) for the Toronto20 benchmark set. We show that relative to delay-oblivious routing, delay-aware routing without any significant optimizations can reduce minimum energy/operation by 1.72x at 22 nm. We demonstrate how to construct an FPGA architecture specifically tailored to further increase the minimum energy savings of component-specific mapping by using the following techniques: power gating, gate sizing, interconnect sparing, and LUT remapping. With all optimizations considered we show a minimum energy/operation savings of 2.66x at 22 nm, or 1.68--2.95x when considered across 45--12 nm. As there are many challenges to measuring resource delays and mapping per chip, we discuss methods that may make component-specific mapping more practical. We demonstrate that a simpler, defect-aware routing achieves 70% of the energy savings of delay-aware routing. Finally, we show that without variation tolerance, scaling from 16 nm to 12 nm results in a net increase in minimum energy/operation; component-specific mapping, however, can extend minimum energy/operation scaling to 12 nm and possibly beyond.</p

Caltech Theses and Dissertations

Packet Switched vs. Time Multiplexed FPGA Overlay Networks

Author: Barnor Henry
DeHon André
deLorimier Michael
Kapre Nachiket
Mehta Nikil
Rubin Raphael
Wilson Michael J.
Wrighton Michael
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Dedicated, spatially configured FPGA interconnect is efficient for applications that require high throughput connections between processing elements (PEs) but with a limited degree of PE interconnectivity (e.g. wiring up gates and datapaths). Applications which virtualize PEs may require a large number of distinct PE-to-PE connections (e.g. using one PE to simulate 100s of operators, each requiring input data from thousands of other operators), but with each connection having low throughput compared with the PE’s operating cycle time. In these highly interconnected conditions, dedicating spatial interconnect resources for all possible connections is costly and inefficient. Alternatively, we can time share physical network resources by virtualizing interconnect links, either by statically scheduling the sharing of resources prior to runtime or by dynamically negotiating resources at runtime. We explore the tradeoffs (e.g. area, route latency, route quality) between time-multiplexed and packet-switched networks overlayed on top of commodity FPGAs. We demonstrate modular and scalable networks which operate on a Xilinx XC2V6000-4 at 166MHz. For our applications, time-multiplexed, offline scheduling offers up to a 63% performance increase over online, packet-switched scheduling for equivalent topologies. When applying designs to equivalent area, packet-switching is up to 2× faster for small area designs while time-multiplexing is up to 5× faster for larger area designs. When limited to the capacity of a XC2V6000, if all communication is known, time-multiplexed routing outperforms packet-switching; however when the active set of links drops below 40% of the potential links, packet-switched routing can outperform time-multiplexing

CiteSeerX

Crossref

Caltech Authors

GraphStep: A System Architecture for Sparse-Graph Algorithms

Author: DeHon André
deLorimier Michael
Eslick Ian
Kapre Nachiket
Knight Thomas F., Jr.
Mehta Nikil
Rizzo Dominic
Rubin Raphael
Uribe Tomás E.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Many important applications are organized around long-lived, irregular sparse graphs (e.g., data and knowledge bases, CAD optimization, numerical problems, simulations). The graph structures are large, and the applications need regular access to a large, data-dependent portion of the graph for each operation (e.g., the algorithm may need to walk the graph, visiting all nodes, or propagate changes through many nodes in the graph). On conventional microprocessors, the graph structures exceed on-chip cache capacities, making main-memory bandwidth and latency the key performance limiters. To avoid this “memory wall,” we introduce a concurrent system architecture for sparse graph algorithms that places graph nodes in small distributed memories paired with specialized graph processing nodes interconnected by a lightweight network. This gives us a scalable way to map these applications so that they can exploit the high-bandwidth and low-latency capabilities of embedded memories (e.g., FPGA Block RAMs). On typical spreading activation queries on the ConceptNet Knowledge Base, a sample application, this translates into an order of magnitude speedup per FPGA compared to a state-of-the-art Pentium processor

CiteSeerX

Crossref

Caltech Authors

Time-Multiplexed FPGA Overlay Networks on Chip

Author: Mehta Nikil
Publication venue
Publication date: 01/01/2006
Field of study

How do we design a communication network for processing elements (PEs) on a single chip that minimizes application communication time and area? In designing such a network it is essential to use a network communication pattern that matches application communication and area requirements. This report characterizes the design space of a particular communication pattern for networks on chip: Time-Multiplexed Interconnect. In contrast to more commonly used packet-switched networks, which route communication dynamically, time-multiplexed networks schedule all possible communication prior to runtime with an offline router. We describe how to build well engineered, highly scalable time-multiplexed FPGA networks in terms of topology selection, routing algorithm design and hardware design that operate on a Xilinx XC2V6000-4 at 166MHz. To benchmark our networks we use real, communication rich applications instead of generating synthetic traffic. We show that over all areas (10K–10M slices) and over all applications the best "one topology fits all" is Butterfly Fat Tree (BFT) with c = 1, p = 0.5, which requires, in the worst case, 6.1x as many cycles to route communication than the optimal topology. We compare time-multiplexing to packet-switching, and show that on average, over all applications for all equivalent topologies, online packet-switched communication requires 1.5x as many cycles to route as offline time-multiplexed scheduling. When applying designs to equivalent area, for areas 100K slices packet-switching requires up to 3.4x as many cycles to route as time-multiplexing in the worst case. Finally, for equivalent, large networks (>100 PEs) time-multiplexing outperforms packet-switching for communication loads where greater than 10% of all logical links are active. This demonstrates that well designed time-multiplexed FPGA overlay networks can deliver performance and area efficiency exceeding that of packet-switched networks

Caltech Theses and Dissertations

Limit study of energy & delay benefits of component-specific routing

Author: DeHon André
Mehta Nikil
Rubin Raphael
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2012
Field of study

As feature sizes scale toward atomic limits, parameter variation continues to increase, leading to increased margins in both delay and energy. The possibility of very slow devices on critical paths forces designers to increase transistor sizes, reduce clock speed and operate at higher voltages than desired in order to meet timing. With post-fabrication configurability, FPGAs have the opportunity to use slow devices on non-critical paths while selecting fast devices for critical paths. To understand the potential benefit we might gain from component-specific mapping, we quantify the margins associated with parameter variation in FPGAs over a wide range of predictive technologies (45nm-12nm) and gate sizes and show how these margins can be significantly reduced by delay-aware, component-specific routing. For the Toronto 20 benchmark set, we show that component-specific routing can eliminate delay margins induced by variation and reduce energy for energy minimal designs by 1.42-1.98×. We further show that these benefits increase as technology scales

CiteSeerX

Caltech Authors

A Resilience Roadmap

Author: Cao Yu
Mehta Nikil
Nassif Sani R.
Publication venue: European Design and Automation Association
Publication date: 01/03/2010
Field of study

Technology scaling has an increasing impact on the resilience of CMOS circuits. This outcome is the result of (a) increasing sensitivity to various intrinsic and extrinsic noise sources as circuits shrink, and (b) a corresponding increase in parametric variability causing behavior similar to what would be expected with hard (topological) faults. This paper examines the issue of circuit resilience, then proposes and demonstrates a roadmap for evaluating fault rates starting at the 45nm and going down to the 12nm nodes. The complete infrastructure necessary to make these predictions is placed in the open source domain, with the hope that it will invigorate research in this area

Caltech Authors

A Resilience Roadmap

Author: Cao Yu
Mehta Nikil
Nassif Sani R.
Publication venue: European Design and Automation Association
Publication date: 01/03/2010
Field of study

ABSTRACT Fetch Halting on Critical Load Misses ∗

Author: Brian Singer
Nikil Mehta
R. Iris Bahar
Publication venue
Publication date: 01/04/2008
Field of study

As the performance gap between processors and memory systems increases, the CPU spends more time stalled waiting for data from main memory. Critical long latency instructions, such as loads that miss to main memory and floating point arithmetic operations, are primarily responsible for these stalls. We present a technique, Fetch Halting, that suspends instruction fetching when the processor is stalled by a critical long latency instruction. This enables us to save power in one of the primary sources of power dissipation, the issue logic. By reducing the occupancy rates in the issue queue and reorder buffer, we save power by disabling a large number of unused queue entries. In order to characterize critical instructions, our approach combines software profiling and hardware monitoring techniques. Statistical profiling information obtained from sample runs is used to identify critical instructions while hardware cache-miss prediction is used to monitor these instructions. We show that, on average, Fetch Halting can reduce issue queue and reorder buffer occupancy rates by 17.2 % and 23.4 % respectively, with an average performance loss of only 4.6%. 1

CiteSeerX

Spatial hardware implementation for sparse graph algorithms in GraphStep

Author: Dehon André
Delorimier Michael
Kapre Nachiket
Mehta Nikil
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2011
Field of study

How do we develop programs that are easy to express, easy to reason about, and able to achieve high performance on massively parallel machines? To address this problem, we introduce GraphStep, a domain-specific compute model that captures algorithms that act on static, irregular, sparse graphs. In GraphStep, algorithms are expressed directly without requiring the programmer to explicitly manage parallel synchronization, operation ordering, placement, or scheduling details. Problems in the sparse graph domain are usually highly concurrent and communicate along graph edges. Exposing concurrency and communication structure allows scheduling of parallel operations and management of communication that is necessary for performance on a spatial computer. We study the performance of a semantic network application, a shortest-path application, and a max-flow/min-cut application. We introduce a language syntax for GraphStep applications. The total speedup over sequential versions of the applications studied ranges from a factor of 19 to a factor of 15,000. Spatially-aware graph optimizations (e.g., node decomposition, placement and route scheduling) delivered speedups from 3 to 30 times over a spatially-oblivious mapping

DR-NTU (Digital Repository of NTU)

GROK-LAB: generating real on-chip knowledge for intra-cluster delays using timing extraction

Author: DeHon André
Gojman Benjamin
Howarth Nicholas
Mehta Nikil
Nalmela Sirisha
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2013
Field of study

Timing Extraction identifies the delay of fine-grained components within an FPGA. From these computed delays, the delay of any path can be calculated. Moreover, a comparison of the fine-grained delays allows a detailed understanding of the amount and type of process variation that exists in the FPGA. To obtain these delays, Timing Extraction measures, using only resources already available in the FPGA, the delay of a small subset of the total paths in the FPGA. We apply Timing Extraction to the Logic Array Block (LAB) on an Altera Cyclone III FPGA to obtain a view of the delay down to near individual LUT granularity, characterizing components with delays on the order of a few hundred picoseconds with a resolution of ±3.2 ps. This information reveals that the 65 nm process used has, on average, random variation of σ/µ = 4.0% with components having an average maximum spread of 83 ps. Timing Extraction also shows that as VDD decreases from 1.2 V to 0.9 V in a Cyclone IV 60 nm FPGA, paths slow down and variation increases from σ/µ = 4.3% to σ/µ 5.8%, a clear indication that lowering V_(DD) magnifies the impact of random variation

Caltech Authors